Goto

Collaborating Authors

 sexual orientation


From Queer-Baiting to Neurodivergence: 'Heated Rivalry's Author Tackles Fan Theories and Controversy

WIRED

"I didn't expect this book to be analyzed like," hockey smut author Rachel Reid tells WIRED. Rachel Reid didn't intend for anyone to write a dissertation about her horny little gay hockey series. Then again, the Nova Scotia author behind the series could never have anticipated the level of fanfare that's accompanied the television adaptation of her books: . The show, commissioned by Canada's Crave and distributed by HBO Max in the US, debuted in late November and quickly became a massive hit. It's the number one Crave original series of all time, and it also climbed to number 1 on HBO Max.


Characterizing Selective Refusal Bias in Large Language Models

Khorramrouz, Adel, Levy, Sharon

arXiv.org Artificial Intelligence

Safety guardrails in large language models(LLMs) are developed to prevent malicious users from generating toxic content at a large scale. However, these measures can inadvertently introduce or reflect new biases, as LLMs may refuse to generate harmful content targeting some demographic groups and not others. We explore this selective refusal bias in LLM guardrails through the lens of refusal rates of targeted individual and intersectional demographic groups, types of LLM responses, and length of generated refusals. Our results show evidence of selective refusal bias across gender, sexual orientation, nationality, and religion attributes. This leads us to investigate additional safety implications via an indirect attack, where we target previously refused groups. Our findings emphasize the need for more equitable and robust performance in safety guardrails across demographic groups.


A word association network methodology for evaluating implicit biases in LLMs compared to humans

Abramski, Katherine, Rossetti, Giulio, Stella, Massimo

arXiv.org Artificial Intelligence

As Large language models (LLMs) become increasingly integrated into our lives, their inherent social biases remain a pressing concern. Detecting and evaluating these biases can be challenging because they are often implicit rather than explicit in nature, so developing evaluation methods that assess the implicit knowledge representations of LLMs is essential. We present a novel word association network methodology for evaluating implicit biases in LLMs based on simulating semantic priming within LLM-generated word association networks. Our prompt-based approach taps into the implicit relational structures encoded in LLMs, providing both quantitative and qualitative assessments of bias. Unlike most prompt-based evaluation methods, our method enables direct comparisons between various LLMs and humans, providing a valuable point of reference and offering new insights into the alignment of LLMs with human cognition. To demonstrate the utility of our methodology, we apply it to both humans and several widely used LLMs to investigate social biases related to gender, religion, ethnicity, sexual orientation, and political party. Our results reveal both convergences and divergences between LLM and human biases, providing new perspectives on the potential risks of using LLMs. Our methodology contributes to a systematic, scalable, and generalizable framework for evaluating and comparing biases across multiple LLMs and humans, advancing the goal of transparent and socially responsible language technologies.


A Modular Taxonomy for Hate Speech Definitions and Its Impact on Zero-Shot LLM Classification Performance

Melis, Matteo, Lapesa, Gabriella, Assenmacher, Dennis

arXiv.org Artificial Intelligence

Detecting harmful content is a crucial task in the landscape of NLP applications for Social Good, with hate speech being one of its most dangerous forms. But what do we mean by hate speech, how can we define it, and how does prompting different definitions of hate speech affect model performance? The contribution of this work is twofold. At the theoretical level, we address the ambiguity surrounding hate speech by collecting and analyzing existing definitions from the literature. We organize these definitions into a taxonomy of 14 Conceptual Elements-building blocks that capture different aspects of hate speech definitions, such as references to the target of hate (individual or groups) or of the potential consequences of it. At the experimental level, we employ the collection of definitions in a systematic zero-shot evaluation of three LLMs, on three hate speech datasets representing different types of data (synthetic, human-in-the-loop, and real-world). We find that choosing different definitions, i.e., definitions with a different degree of specificity in terms of encoded elements, impacts model performance, but this effect is not consistent across all architectures.


Tinder is testing a HEIGHT filter - as devastated users say it's 'over for short men'

Daily Mail - Science & tech

But Tinder has sparked controversy this week, following the launch of its latest feature. The dating app has quietly started testing a height filter. Spotted within the Premium Discovery section of Tinder's Settings, the tool allows users to specify the minimum and maximum heights for their matches. Posting a screenshot to Reddit, user @Extra_Barracudaaaa wrote: 'Oh God. They add a height filter.'


Can Prompting LLMs Unlock Hate Speech Detection across Languages? A Zero-shot and Few-shot Study

Ghorbanpour, Faeze, Dementieva, Daryna, Fraser, Alexander

arXiv.org Artificial Intelligence

Despite growing interest in automated hate speech detection, most existing approaches overlook the linguistic diversity of online content. Multilingual instruction-tuned large language models such as LLaMA, Aya, Qwen, and BloomZ offer promising capabilities across languages, but their effectiveness in identifying hate speech through zero-shot and few-shot prompting remains underexplored. This work evaluates LLM prompting-based detection across eight non-English languages, utilizing several prompting techniques and comparing them to fine-tuned encoder models. We show that while zero-shot and few-shot prompting lag behind fine-tuned encoder models on most of the real-world evaluation sets, they achieve better generalization on functional tests for hate speech detection. Our study also reveals that prompt design plays a critical role, with each language often requiring customized prompting techniques to maximize performance.


Mapping the Italian Telegram Ecosystem: Communities, Toxicity, and Hate Speech

Alvisi, Lorenzo, Tardelli, Serena, Tesconi, Maurizio

arXiv.org Artificial Intelligence

Telegram has become a major space for political discourse and alternative media. However, its lack of moderation allows misinformation, extremism, and toxicity to spread. While prior research focused on these particular phenomena or topics, these have mostly been examined separately, and a broader understanding of the Telegram ecosystem is still missing. In this work, we fill this gap by conducting a large-scale analysis of the Italian Telegram sphere, leveraging a dataset of 186 million messages from 13,151 chats collected in 2023. Using network analysis, Large Language Models, and toxicity detection tools, we examine how different thematic communities form, align ideologically, and engage in harmful discourse within the Italian cultural context. Results show strong thematic and ideological homophily. We also identify mixed ideological communities where far-left and far-right rhetoric coexist on particular geopolitical issues. Beyond political analysis, we find that toxicity, rather than being isolated in a few extreme chats, appears widely normalized within highly toxic communities. Moreover, we find that Italian discourse primarily targets Black people, Jews, and gay individuals independently of the topic. Finally, we uncover common trend of intra-national hostility, where Italians often attack other Italians, reflecting regional and intra-regional cultural conflicts that can be traced back to old historical divisions. This study provides the first large-scale mapping of the Italian Telegram ecosystem, offering insights into ideological interactions, toxicity, and identity-targets of hate and contributing to research on online toxicity across different cultural and linguistic contexts on Telegram.


Measuring Online Hate on 4chan using Pre-trained Deep Learning Models

Bermudez-Villalva, Adrian, Mehrnezhad, Maryam, Toreini, Ehsan

arXiv.org Artificial Intelligence

-- Online hate speech can harmfully impact individuals and groups, specifically on non - moderated platforms such as 4chan where users can post anonymous content. This work focuses on analy s ing and measuring the prevalence of online hat e on 4chan's politically incorrect board (/pol/) using state - of - the - art Natural Language Processing (NLP) models, specifically transformer - based models such as RoBERTa and Detoxify . By leveraging these advanced models, we provide an in - depth analysis of hate speech dynamics and quantify the extent of online hate non - moderated platforms. The study advances understanding through multi - class classification of hate speech (racism, sexism, religion, etc.), while also incorporating the classification of toxic content (e.g., identity attacks and threats) and a further topic modelling analysis. The results show that 11.20% of this dataset is identified as containing hate in different categories. These evaluations show that online hate is manifested in various forms, confirming the complicated and volatile nature of detection in the wild. Index Terms -- Hate speech, machine learning, natural language processing (NLP), online hate, toxicity analysis. INTRODUCTION H E SPREAD of hate speech on online platforms has become a serious problem in our society. As digital communication becomes ubiquitous, platforms like 4chan, known for their anonymity and minimal moderation, have become hotspots for this harmful behaviour . This is particularly evident on its politically incorrect board, /pol/, a notorious board dedicated to discussing politics and current events, often associated with hate speech, extremist content, and conspiracy theories [1] . The anonymity provided by these platforms often encourages users to express extreme ideologies [2] . This issue raises significant concerns about the impact on at - risk and vulnerable groups as it can cause real - world harm, including psychological trauma. Therefore, a systematic approach is needed to measure and understand the prevalence and forms of online hate. Received 28 August 2024; revised 23 December 2024, 10 February 2025, and 6 March 2025; accepted 6 March 2025. This work is supported by the UK Research and Innovation (UKRI), through the Strategic Priority Fund as part of the Protecting Citizens Online programme (AGENCY: Assuring Citizen Agency in a World with Complex Online Harms, EP/W032481/2).


Fine-Tuned LLMs are "Time Capsules" for Tracking Societal Bias Through Books

Madhusudan, Sangmitra, Morabito, Robert, Reid, Skye, Sadr, Nikta Gohari, Emami, Ali

arXiv.org Artificial Intelligence

Books, while often rich in cultural insights, can also mirror societal biases of their eras - biases that Large Language Models (LLMs) may learn and perpetuate during training. We introduce a novel method to trace and quantify these biases using fine-tuned LLMs. We develop BookPAGE, a corpus comprising 593 fictional books across seven decades (1950-2019), to track bias evolution. By fine-tuning LLMs on books from each decade and using targeted prompts, we examine shifts in biases related to gender, sexual orientation, race, and religion. Our findings indicate that LLMs trained on decade-specific books manifest biases reflective of their times, with both gradual trends and notable shifts. For example, model responses showed a progressive increase in the portrayal of women in leadership roles (from 8% to 22%) from the 1950s to 2010s, with a significant uptick in the 1990s (from 4% to 12%), possibly aligning with third-wave feminism. Same-sex relationship references increased markedly from the 1980s to 2000s (from 0% to 10%), mirroring growing LGBTQ+ visibility. Concerningly, negative portrayals of Islam rose sharply in the 2000s (26% to 38%), likely reflecting post-9/11 sentiments. Importantly, we demonstrate that these biases stem mainly from the books' content and not the models' architecture or initial training. Our study offers a new perspective on societal bias trends by bridging AI, literary studies, and social science research.


DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?

Khurana, Urja, Nalisnick, Eric, Fokkens, Antske

arXiv.org Artificial Intelligence

When building a predictive model, it is often difficult to ensure that application-specific requirements are encoded by the model that will eventually be deployed. Consider researchers working on hate speech detection. They will have an idea of what is considered hate speech, but building a model that reflects their view accurately requires preserving those ideals throughout the workflow of data set construction and model training. Complications such as sampling bias, annotation bias, and model misspecification almost always arise, possibly resulting in a gap between the application specification and the model's actual behavior upon deployment. To address this issue for hate speech detection, we propose DefVerify: a 3-step procedure that (i) encodes a user-specified definition of hate speech, (ii) quantifies to what extent the model reflects the intended definition, and (iii) tries to identify the point of failure in the workflow. We use DefVerify to find gaps between definition and model behavior when applied to six popular hate speech benchmark datasets.